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This paper presents an innovative technique to explore the effect on energy 
consumption of an extensive number of the optimisations a compiler can perform. 
We evaluate a set of ten carefully selected benchmarks for five different embedded 

platforms. 

A fractional factorial design is used to systematically explore the large 
optimisation space (2^'^ possible combinations), whilst still accurately determining 
the effects of optimisations and optimisation combinations. Hardware power 
measurements on each platform Eire taken to ensure all architectural effects on 

the energy consumption Eire captured. 
In the majority of cases, execution time and energy consumption are highly 
correlated. However, predicting the effect a particular optimisation may have 
is non-trivial due to its interactions with other optimisations. This validates long 
standing community beliefs, but for the first time provides concrete evidence of 

the effect and its magnitude. 
A further conclusion of this study is the structure of the benchmark has a larger 
effect than the hardware architecture on whether the optimisation will be effective, 
and that no single optimisation is universally beneficial for execution time or 

energy consumption. 

Keywords: Compilers; Energy optimisation; Optimisation selection; Fractional factorial 

design; Energy consumption 



1. INTRODUCTION 

Energy consumption is rapidly becoming one of 
the most important design constraints when writing 
software for embedded platforms. In the hardware 
space there are many features, such as clock gating 
and dynamic frequency and voltage scaling, to reduce 
the power consumption of electronic devices. However, 
inefRcient software can negate any gains from the 
hardware, so the combination of software and hardware 
must be considered together when exploring energy 
usage. This study focuses on processors for embedded 
platforms, because energy efficiency is particularly 
important for many of their target applications. 

Optimising the software for low energy consumption 
is particularly important when adhering to strict power 
budget. This is the case in many deeply embedded 
systems. In these devices the processor is a significant 
consumer of energy — a previous study characterised 
the CPU power usage of a handheld device to be 
between 20 and 40% of the total system power [1]. A 



further study, based on 45 nm technology data from [2] , 
calculated the power dissipation of the processors in 
a 64-core network on chip to be 40% of the entire 
system [3]. This category was the largest, ahead of 
memory, network, and I/O. 

Compiler optimisations have the potential for energy 
savings with no changes to existing hardware or 
software just tweaking the compiler's parameters can 
have a large effect on the energy consumption [4]. This 
relationship is complex, with the program, processor 
architecture and specific compiler options interacting 
together. Furthermore, different optimisation passes 
interact with each other, so an option's efficacy 
cannot be tested in isolation. For example, inlining 
a function may mean that more effective common 
subexpression elimination can be performed, increasing 
the performance more than either option individually. 
Many approaches have attempted to solve this problem, 
using techniques such as statistical methods [5] , genetic 
algorithms [6] and iterative compilation [7] . All of these 
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Processor 



Board name 



RAM Core clock Other 



ARM Cortcx-MO 
ARM Cortex-M3 
ARM Cortex-A8 
Adapteva Epiphany 
XMOS LI 



STM32F0DISCOVERY 8KB 48 MHz 

STM32VLDISCOVERY 8KB 24 MHz 

BeagleBone 256MB 500 MHz 

EMEK3 32KB /core 400 MHz 

XKl 64KB 100 MHz 



64KB Flash 
128KB Flash 
VFP/NEON, superscalar 
FPU,superscalar,16 core NoC 
4xl00MHz hardware threads 



TABLE 1. The platforms explored in this paper along with some relevant details. 
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FIGURE 1. The hardware and software setup used to take 
the measurements. 



studies conclude that performance can be increased by 
choosing the correct set of optimisations, but exploring 
the space to find this set is challenging. 

The following section covers the overall aims and 
hypotheses we wish to address in this paper. Then, 
the related work is discussed. Following this section, 
our approach to the problem of benchmark selection 
and compiler flags is given. Then, the initial high-level 
results are presented with discussion of the first two 
hypotheses. After this, there is a short introduction 
to fractional factorial design, followed by the results 
obtained using this technique. Then, case studies of 
the most efFcctivc optimisations, and the interactions 
between optimisations is given. Finally, concluding 
remarks to the application developer and the compiler 
writer are made. 

2. OVERVIEW OF THIS WORK 

The overall aim of this work is to identify compiler 
optimisations which are effective at reducing a 
benchmark's energy consumption. This is accomplished 
by using an advanced analysis to account for 
interactions between the optimisations, without having 
to enumerate all combinations of optimisations. This 
analysis is performed for multiple benchmarks and 
platforms, allowing general conclusions to be drawn 
about how the optimisations affect energy consumption. 
We investigate the following hypotheses: 

1. The time and energy required for a computation 
are correlated. 



2. There exists a set of compiler optimisations 
that gives a lower energy consumption than the 
predefined optimisation levels. 

3. It is possible to search the compiler optimisation 
space in an efficient and systematic manner, to 
assign each optimisation an overall effectiveness. 

4. There is no universally good optimisation across 
multiple benchmarks and platforms. 

We will evaluate the validity of these hypotheses 
by performing a series of practical experiments which 
target the set of optimisations enabled at various 
optimisation levels of a real compiler. This allows the 
optimisations to be measured for their effect on energy. 
Three types of experiment are performed in this study: 

High-level. Each optimisation level (predefined set of 
optimisations) is tested for each benchmark and 
platform. 

Fractional factorial design. A fractional factorial 
design is used to find the effectiveness of each 
optimisation flag defined at the optimisation level. 
This experiment is repeated for each optimisation 
level, and for each benchmark and platform 
combination. 

Case study. Two case studies arc performed. The 
flrst looks at the most effective optimisation flags 
across benchmarks and platforms. The second 

explores the interactions between optimisations by 
exhaustively applying a small set of optimisations. 

All the energy measurements in this paper are taken 
using physical measurement circuitry attached to the 
processors. This avoids the use of models which could 
be inaccurate, or modelling synthetic processors with no 
real world counterpart. A diagram of the software and 
hardware setup is shown in Fig. 1. By using commonly 
available platforms and processors along with some 
more novel architectures, the results are applicable in 
general while still providing insight into how different 
types of architectures perform. There are five platforms 
examined in this paper, as shown in Tab. 1. 

This work makes the following contributions: 

• The use of fractional factorial design to analyse a 
previously intractable optimisation space of GCC's 
optimisation options. 

• Analysis of relative importance of each optimisa- 
tion across multiple bc^nchmarks and platforms. 
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• The answers to the previously given hypotheses. 

• Commentary on how these techniques and results 
can be used by application developers and compiler 
writers. 

3. RELATED WORK 
3.1. Compilers &: Energy 

To date there has been very little work extensively ex- 
ploring the effect that different compiler optimisations 
have on energy consumption. However, there have been 
many studies that look at the effect of optimisations on 
execution time [5], and several studies suggesting that 
execution time can be used as a proxy for energy usage. 

The topic of performance and energy being highly 
correlated is addressed in [8]. This work explored 
several different overall optimisation levels, as well 
as four specific optimisations, using the Wattch 
simulator [9] to estimate energy results. However, the 
specific optimisations were all applied individually on 
top of the first optimisation level, without exploring 
any possible interactions between the optimisations. 
The main conclusion drawn from this study was that 
most optimisations reduce the number of instructions 
executed, hence reducing energy consumption and 
execution time simultaneously. 

Of the studies that look at individual optimisations 
and their effects on energy, most focus on only 
a few optimisations in isolation and few consider 
multiple platforms with different architectural features. 
Commonly explored optimisations, such as loop 
unrolling [10], loop fusion [11], function inlining [12] 
and instruction scheduling [13], have been examined 
extensively for different platforms using both simulators 
and hardware measurements [14]. 

A drawback of the studies that do explore 
energy consumption is that many of them choose 
to use simulators as opposed to taking hardware 
measurements. The Wattch simulator [9] is commonly 
used and is designed to allow easy energy measurements 
while exploring architectural configurations. The 
accm-acy of Wattch is established at being within 10% of 
an industry layout-level power tool. However, Wattch 
does not model every hardware component in the 
processor, which makes it difficult to be certain about 
the total energy consumption of the processor. 

SimplePower [15] is another simulator that has been 
used to explore the energy consumption of the software 
running on a processor. This simulator targets a 
five stage RISC pipeline, with energy consumption 
estimates based on the number of transitions on bus 
signal lines as well as various other components. 

Various other models have been created to simulate 
power consumption of the processor, including complex 
instruction level models [16], function-level models [17] 
and hybrids of these [18]. However, these all suffer the 
drawback that some energy consumption effects may 
not be modc^Ued, potentially skewing th(^ rcsTilts. 



3.2. Optimisations Targeting Energy 

Many previous studies look at how to utilise existing 
optimisations to target energy consumption. However, 
all of these optimisations were written with the aim 
of reducing execution time, not energy consumption. 
Several other techniques have been proposed to 
develop optimisations that specifically target energy 
consumption. 

An analysis of the techniques the compiler can 
perform to optimise for energy was carried out by 
Tiwari, Malik and Wolfe [19]. They identified several 
possible techniques that compilers could use to reduce 
the energy consumption of programs: 

• Reorder instructions to reduce switching. 

• Reduce switching on address lines. 

• Reduce memory accesses. 

• Improve cache hits. 

• Improve page hits. 

It is expected that optimisations covering some of 
these points will have an effect on energy. The last 
three will also normally increase performance as well as 
reduce energy. 

Several novel types of compiler optimisations have 
been proposed. Seth et al [20] explored the 
possibility of using the compiler to insert idle 
instructions automatically, increasing the execution 
time up to a set limit. Using the SIMD pipeline 
has been shown to decrease energy consumption [21] 
by roughly 25%. Scheduling instructions to minimise 
the inter-instruction energy cost was evaluated to be 
another effective method to reduce a program's energy 
consumption [22]. 

3.3. Optimisation Choice 

The challenge of choosing the optimisations and their 
order has been explored and many methodologies 
proposed for choosing an optimal set of optimisations. 

Chakrapani et al attempt to classify optimisations by 
the effect they have on performance and energy [23]. 
This work used both hardware measurements and a 
gate-level simulation to derive the results, separating 
the optimisations into the following three classes: 

• Reduction in energy consumption due to 

increase in performance. Optimisations in this 
class reduce the number of cycles or instructions 
needed to complete the application and thus less 
overall work is done. 

• Optimisations that reduce energy while 
not improving the performance. Scheduling 
instructions to reduce switching often falls into this 
category. 

• Optimisations that increase energy con- 
sumption or performance. 

Iterative compilation has been examined as a 

possibility for choosing optimisations that reduce power 
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by Gheorghita et al [24]. In this paper, the effect 
of different loop unrolling and loop tiling parameters 
on energy consumption is examined for three linear- 
algebra-based benchmarks using a simulator. The 
paper concluded that iterative compilation was an 
effective method of decreasing energy consumption as 
well as improving performance. 

Other approaches have looked at genetic algorithms 
for optimisation selection [6] and optimisation phase 
ordering [25]. While those techniques are shown to 
be effective, they have the drawback that the reasons 
behind an optimisation's selection is not obvious. 

These techniques do not expose the relationships 
between optimisations, instead opting to search though 
the optimisation space and making a best guess 
where to look next. In this paper, we improve 
these shortcomings using fractional factorial design [26] 
to explore the most effective optimisations and the 
interactions between them. 

Fractional factorial design, as a method for exploring 
the interactions of compiler optimisations was discussed 
in [27]. Nine optimisations were examined, using a 
fractional factorial technique to isolate the interactions 
and choose a set of optimisations that gave better 
performance than just enabling all the optimisations. 

A similar study conducted by Patyk et al [28], 
extended this work to energy efficiency. The study 
explored a range of GCC's options, with an aim to 
reduce energy consumption by identifying significant 
optimisations, then excluding them from further 
exploration using fractional factorial design. We use 
this technique to analyse the optimisations rather than 
optimise for the energy consumption. 

These methods all require testing over many different 
compilations, which is a significant overhead when 
finding an optimal set. The MILEPOST GCC [29] 
study implemented an alternative to this, using machine 
learning to guess which optimisation flags would best 
apply to a given program. Features are extracted from 
the program, which are then matched against previous 
known results from previous compilations. This allows 
a set of optimisations to be estimated from just the 
source code. 

The majority of the studies listed in this section 
only examine one platform, and it is currently 
unknown whether their results would apply across 
several different platforms. Furthermore, iterative 
compilation [7] and other adaptive techniques used can 
leave holes of potential combinations of optimisations 
unexplored (due to the huge numbers of combinations 
possible). This can lead to the most optimal 
configurations not being found. 

4. APPROACH 

In this paper we present an improved technique for 
testing the effectiveness of large numbers of compiler 
optimisations and thc^ir impact on c>n(Tgy (;onsumption 
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SHA 


MiBench 


Memory, integer 



TABLE 2. Benchmarks selected, and the types of 
instructions they execute intensively (FP short for Floating 
Point). 

and run-times. The technique is based on the concept 
of fractional factorial design (see Sect. 7). 

4.1. A New Benchmetrk Set 

To explore the impact of the optimisations, a realistic 
set of test input programs is required. There have been 
many attempts to find representative programs, but 
none apply to a wide range of embedded processing 
systems. AS a result, we have derived a set of 
benchmarks from contemporary suites. This set of 
10 benchmarks, shown in Tab. 2, covers real world 
and synthetic applications across different aspects 
of the target platform. These are selected from 
MiBench [30] and the Worst Case Execution Time 
(WCET) [31] suites. Previous work on modelling the 
energy consumption of processors has shown that the 
pipelines and functional units enabled have a significant 
impact on the energy consumption. To cover these 
points, the benchmarks were characterised according to 
the following coarse criteria: 

• Integer pipeline intensity. The frequency at 
which integer arithmetic instructions occur. 

• Floating point intensity. The frequency of 
floating point operations. 

• Memory intensity. Whether the program 
requires a large amount of memory bandwidth or 
not. 

• Branch frequency. How often the code branches. 

Similar categories of instruction types have been used 
previously to give a high level overview of the type of 
computation an application is performing [32]. Our 
categories group similar instruction, such as the loads 
and stores in MiBench, since energy consumption is 
predominately related to the target functional unit, 
rather than the specific operation. 

This set of benchmarks is chosen because they do 
not require a host operating system. This prevents the 
benchmark from being pre-empted by another process 
and reducing the accuracy of the results. This making 
the execution of the benchmarks deterministic. For the 
samc> rc^asons. the benchmarks do not perform any I/O. 
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FIGURE 2. Energy, time and power results for benchmark-platform combinations. Optimisation levels 00 to 04. 04 is 03 
with link-time optimisation. The last point is Os — optimise for space. Some results are unavailable for when the compiler 
crashed while producing the output binary. 



The benchmarks are also chosen carefully with 
regards to cache performance. Reducing memory 
accesses and optimising the code for cache performance 
is a well known way of decreasing execution time 
and energy performance. In this study we do not 
consider cache performance, since the majority of 
deeply embedded systems do not have a cache hierarchy 
and perform memory accesses in a single cycle. We 
address this issue by choosing benchmarks that are 
small enough to fit into the LI cache of the only 
platform with caches (Cortex- A8). With 'warm-start' 
measurements this means the cache acts as a single cycle 
memory, making it comparable in predictability to the 
other platforms. 

4.2. Compiler Flags 

We explore the impact of compiler optimisations using 
the GCC toolchain on the architectures shown in Tab. 1. 
GCC exposes its various optimisations via a number 



of flags that can be passed to the compiler [33]. We 
explore the which flags have a significant impact on 
energy consumption and execution time. 

The experiments are performed with different 
benchmarks, so a complete picture of architecture, 
optimisation and application can be seen. By taking 
this combination, the following points of interest can 
be explored: 

• The relationship between time and energy; 

• Architectural effects on energy consumption; and 

• Application effects on energy consumption. 

By using the techniques we have just outlined, we can 
rigorously evaluate our hypotheses, answering questions 
about the relationship between time and energy, and 
optimisation choice. 

5. TIME AND ENERGY 

The following section addresses the first hypothesis, and 
show that energy consumption and execution time are 



???, Vol. ??, No. ??, ???? 



6 



J. Pallister and S. Mollis and J. Bennett 



10 



6 



2 

cu 

■I 

01 

1-2 



-6 



Energy 
Time 




01 Flags (-) 




n^- -j^Kt 



02 Flags (+) 



>,t/l^ C uPsI UO) D.Q.l/1 0)1/1 0)-E= AiCLl/l OJfO OJOJl/l-^T^ fCO) 0)Q.D-CL 
-t^-t; uooJCOuo u-hi^-^ uu crc 0) p-t^ J^t=-td ^ u uu o o.: 




t/1 QJ '-Q- 

,oi-t^-S° 



u o 



O cO) 

OJ-j^V 
> u 
o c 



fD re 



!- 0) 

0) 1/1 

oicE- 

OJ o 
s- Q. 
, fC 1 
3 Q.CU 

■EE 

O fD 



"-til OJ 



OQ. 

■SE[ 



OJ l/l' 
0.1/1 
4- O 



O-V i/i 

° y C 
T l/l 

0) U-— 

d) 

re 0)^ 
0) ct: 



O) 

u u 

OT3 



t/1 l/l U l/l L 

-9y-6-f-= 

; 0) c ; 



i/ii/ii/iOJi/ii/i CrDQlOJOJOll/l >0)1/1CI/1Q. 

c£ci/iCLQ.oh:ccn>c-^ =;^=S clu 
' 2rr oiQ;:^ E §,c-a oio) c7ill!-=^ 



(1 u UrD 

- C C N 



— ■ 0) l/l u — 



0) 



0) c 

V 0) 

4-' 

V 
(U 

p 



Ei , 

l/l O OJ 
(11 I -> 



Oli: 



FIGURE 3. Blowfish benchmark on the Cortex-M3 platform, 
optimisation level. 



Individual options are enabled or disabled on top of the 01 



correlated across a significant number of benchmarks 
and platforms. A high level overview of each platform 
and benchmark for the different optimisation levels is 
given in Fig. 2. This figure shows a line graph for 
each combination, displaying the effect of the broad 
optimisation levels 01, 02, 03, 04 (defined as 03 with 
link time optimisation) and Os (optimise for space) on 
time, energy and power. 

For the Cortex-MO, very little difference between 
energy and time is seen due to it being the simplest 
processor tested, it has a three stage pipeline without 
forwarding logic. The pipeline behaviour is simple, only 
stalling if it encounters a load or a branch, thus it is 
not sensitive to specific code sequences. The Cortex- 
M3 exhibits very similar behaviour, with some very 
slight differences between energy and time. The micro- 
architecture in this processor is more complex, featuring 
branch speculation and a larger instruction set [34] . 

The XMOS processor has a four stage pipeline, sim- 
ilar to the Cortex-M3 in complexity and performance. 
It should also be noted that the compiler for the XMOS 
processor uses an LLVM backend [35] for code genera- 
tion, featuring different optimisations. Due to this the 
result set for this processor is not as extensive as the 
other four, but is still broadly comparable. 

The Epiphany processor also sees a large correlation 
between the energy consumption and execution time. 
There is some divergence when the superscalar core in 
the processor is able to dispatch multiple instructions 
simultaneously. This gives the compiler more potential 
for creating advantageous code sequences. 

The greatest difference between energy and time 



was discovered while using the Cortex-A8. For the 
majority of the benchmarks the execution time reduces 
more than the energy. This is due to multiple 
instructions being executed simultaneously by the core, 
reducing the amount of time taken but not the energy 
consumption, as the same total work is still being 
done. We infer from this that the amount of pipeline 
activity has a significant measurable effect on the energy 
consumption. The gap is also seen to widen at the 02 
level, due to instruction scheduling being enabled there. 

These results support our first hypothesis that time 
and energy are broadly correlated. The strongest 
correlation occurs in the qualitatively 'simplest' 
pipelines. Increasing pipeline complexity means there 
are more opportunities for architectural energy saving 
measures (clock gating, etc.) making the complex 
processor's energy profile more variable and improving 
the potential for compiler optimisation impact. 

6. OPTIMISATION POTENTIAL 

The second hypothesis we wanted to explore was that 
it was possible to find a set of optimisations that 
performed better than the standard optimisation levels. 
Fig. 3 shows each option in 01 and 02 optimisation 
levels enabled on top of the fiags in 01. By examining 
the left of the graph, it can be seen that by disabling 
-f guess-branch-probability (in this specific run) 
the energy decreases by 4%. This shows that a set of 
optimisations that performs better than the predefined 
01 optimisation level. 

This conclusion is in line with much of the 
related work, that has focused on choosing a set of 
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FIGURE 4. Reducing a 3-factor full factorial design to a 
'half fraction' design. 



optimisations which is more optimal than the standard 
optimisation levels for a given benchmark. 
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7. FRACTIONAL FACTORIAL DESIGN 

This section explores the third hypothesis — a method 
to systematicaUy explore the optimisation space. 

GCC has over 150 different options that can be 
enabled to control optimisations. The majority of these 
options are binary — the optimisation pass is either 
enabled or disabled. To further complicate matters, 
an optimisation path may be affected by other passes 
happening before it. It is not feasible to test all possible 
combinations of options, therefore a trade-off has to 
be made. One of our main contributions is to deploy 
fractional factorial design [26] (FED) to massively 
reduce the number of tests to explore the space, whilst 
still identifying the options that contribute to run-time 
and energy. This approach has been explored on a small 
scale in [27], where nine optimisations were explored in 
just 35 tests as opposed to the 512 required for a full 
factorial design. 

An example full factorial design is shown on the left 
of Fig. 4. This example shows three factors with every 
possible combination enumerated. A fractional factorial 
design with the number of tests halved is shown on the 
right, yet still allows the difference between any two 
factors to be estimated. 

The drawback to this approach is that the high- 
order interactions between options (effects due to 
multiple options being enabled) will not be discernible. 
Fortunately this is not usually a problem as these types 
of interactions are statistically rare. The degree to 
which this happens is specified by the FFD's resolution. 
A resolution 5 design ensures that the main effects 
are not aliased with anything lower than 4th order 
interactions. 

Using the Yates algorithm [26] the effect for any 
single or combination of factors can be found from 
the data. This gives an estimate for how much this 
factor or interaction affects the result of the experiment. 
The Mann- Whitney statistical test is used to determine 
whether the factor represents a significant change in 
performance as detailed in [28] and [5]. 

All FFDs used were generated by the statistical 



FIGURE 5. Blowfish benchmark on the Cortex-MO 
platform. Individual options enabled at 01 are listed. 
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FIGURE 6. FDCT benchmark on the Cortex-M3 
platform. Individual options enabled at 02 are listed. 



program, R [36] (a statistical programming language), 
using the FrF2 library [37]. 

7.1. FED Results 

The results from the FED experiments provide 
additional evidence to back up the first hypothesis, of 
execution time and energy being correlated. 

Results showing the correlation between time and 
energy are shown in Fig. 5. This shows the main 
effect each optimisation has on the runtime and energy 
consumption, as calculated by the FED. As these results 
come from a total of 2048 separate runs, even a small 
percentage change is statistically significant. This 
significance is calculated using the Mann- Whitney test. 
The bracket above the bars indicates when the result 
satisfies the following hypothesis: it is 95% certain that 
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Cortex- A8 pipeline, where it is expensive to copy results 
between the NEON unit and the standard registers. 

Further investigation of the NEON SIMD unit was 
done using some simple tests consisting of executing a 
single instruction many times. The results of these are 
shown in Tab. 3, showing for multiplication the NEON 
unit uses around 20% less power than using the normal 
Cortex- A8 multiplier. This is in line with what previous 
studies have found [21] and shows that by using the 
hardware to its full capacity, the greatest energy savings 
can be achieved. 
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FIGURE 7. 2D FIR benchmark on the Cortex-A8 
platform. Individual options enabled at 03 are listed. 

NEON Instruction De- Continuous Power 
pendencies Consumption 



No 
No 
Yes 
Yes 



Yes 
No 
Yes 
No 



168 mW 
195 mW 

158 mW 

159 mW 



TABLE 3. Micro-benchmark results for multiplications 
on the NEON unit, with and without inter-instruction 
dependencies. 



the result represents a significant impact on the energy 
consumption of the benchmark. 

Fig. 6 highlights a discrepancy that occurred between 
execution time and energy consumption, even for very 
similar optimisations. The first two options listed 
(-f schedule-insns and -f schedule-insns2) both 
schedule instructions to reduce pipeline stalls. However 
the latter option performs its scheduling pass after 
register allocation, whereas the first performs it before. 
In this case scheduling before the register allocation 
reduces the energy consumption by much more than 
the execution time. 

7.2. Efficient SIMD Units 

In this section we analyse a specific case when energy 
consumption and execution time are not correlated. 

An interesting effect is seen in 2D FIR for the 
Cortex- AS. The execution time decreases more than 
the energy consumption up to 02, but when enabling 
03 the energy decreases proportionally more than the 
execution time. On further investigation, this is 
caused by the -f tree-vectorize optimisation having 
an impact on energy consumption with no change in 
execution time (shown in Fig. 7). This option vectorizes 
loops, so that SIMD instructions can be inserted. We do 
not see a performance boost due to the structure of the 



8. THE UNIVERSALITY OF FLAGS 

We have seen large variations based on optimisation 
flags, and so an interesting problem for compiler 
designers is how to choose an optimal set of flags across 
different hardware platforms and applications. This 
section explores which individual flags had the largest 
effect in our experiments and our fourth hypothesis: 
that a consistently good optimisation is not seen across 
all benchmarks and platforms. Tab. 4 lists the results 
for this section, with the top three optimisation flags 
(where that optimisation has a significant effect, as per 
the Mann- Whitney test) identified for each benchmark 
and platform combination. Each letter represents an 
optimisation that is labelled in the table below. We 
also show the number of times this flag occurs. 

Only 20 out of 82 (the number of flags enabled by 01, 
02 and 03) options examined appear in the table. This 
supports the argument that many of the options have 
little effect on the energy consumption and consequently 
performance. 

For the ARM platforms, a similar set of options 
appears for the same benchmarks. Common options for 
the same benchmarks are expected, since optimisations 
are triggered by the structure of the source code. 
However, the opposite of this is seen for the Epiphany 
processor — there are three optimisations that are 
consistently effective at reducing energy. A particularly 
unusual option to be consistently effective is -fdce: 
dead code elimination, removing code which is never 
used by the application. However, this also allows the 
compiler to eliminate parts of the control flow graph, 
removing branches and decreasing the amount of work 
the application performs. 

The optimisation listed most frequently in the table 
is -f tree-dominator-opts. The prevalence of this flag 
is likely due to it enabling several simple optimisation 
passes, performing optimisations such as copy propa- 
gation and expression simplification. Another effective 
optimisation is -f omit-f rame-pointer. This optimi- 
sation frees an additional register for general use by not 
using a frame pointer. This optimisation is seen fre- 
quently on the ARM platforms, however not at all on 
the Epiphany. This is likely due to the ARM processors 
suffering from greater register pressure since they only 
have 16 registers compared to the Epiphany's 64. 
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Benchmark 



2dfir 

blowfish 

crc32 

cubic 

dijkstra 

fdct 

float_niatmult 
int_matmult 

rijndacl 
sha 



Cortex-MO 



Cortex-M3 



Cortex-A8 

st nnd ord 



Epiphany 




ID 


Count Flag 


ID 


Count Flag 


A 


12 


-f tree-dominator-opts 


K 


5 


-fschedule-insns 


B 


12 


-f tree-loop-optimise 


L 


3 


-f inline-small-functions 


C 


11 


-fomit-frame-po inter 


M 


3 


-f schedule- insns2 


D 


8 


-f dee 


N 


3 


-f tree-pre 


E 


7 


-f guess-branch-probability 





2 


-f tree-sra 


F 


7 


-f move-loop- inveiriants 


P 


1 


-f ipa-prof ile 


G 


7 


-f tree-ter 


Q 


1 


-f tree-pta 


H 


7 


-f tree-f re 


R 


1 


-fcombine-st ack-ad j ustment s 


I 


6 


-f tree-ch 


S 


1 


-f gcse 


J 


5 


-f tree-f orwprop 


T 


1 


-fpeephole2 



TABLE 4. Table showing the most effective option for each platform-benchmark combination, 
optimisations enabled by 01, 02 and 03 levels. 



Options considered were 



We see some interesting correlations between plat- 
forms. The CRC32 benchmark does not have much 
optimisation potential since it consists of simple opera- 
tions in a tight loop. We indeed observe that very few 
optimisations have a significant effect. Only one com- 
mon option (-fmove-loop-invaricoits) appears across 
three of the four platforms. This optimisation moves re- 
dundant calculations out of loops. 

As observed, some options are seen to affect the 
energy consumption across benchmarks. This is due 
to the optimisations being targeted to specific code 
patterns, which only appear in some of the benchmarks. 
In particular we see effective options across different 
ARM platforms, while these same optimisations are not 
effective on the Epiphany. Since each of the ARM 
Platforms is using a slightly different instruction set 
(Thumb, Thumb-FThumb2 and ARM for the Cortex- 
MO, Cortex-M3 and Cortex- A8 respectively), we infer 
that the effectiveness of these options is due to 
commonalities between these instruction sets — namely 
the number of registers. 

These results show the difficulty of choosing one 
optimisation which is good in all cases. In many 
cases the instruction set and miero-architecturc of the 
processor have a large effect on how much the energy 
consumption is reduced. This means that a singularly 
good optimisation cannot be chosen. 



8.1. Optimisation Chaos 

In this section, we expand on there being no universally 
good optimisation by investigating the effect of 
interactions between optimisations. We conclude that 
there is a chaotic relationship between the platform, 
benchmark, and the effectiveness of the optimisation. 

Examining the correlation between optimisations and 
their effects is a complex issue. Due to non-linear 
interactions, one would expect that the prediction of 
effects is difficult. This is borne out by our experimental 
results: as seen in previous Figs. 5 and 6, less than 
a third of the options have a significant impact. For 
the other optimisations, higher order interactions cause 
unpredictable effects, where enabling or disabling a 
particular optimisation can completely change the effect 
of many other subsequent optimisation passes. 

In Fig. 3, several unexpected effects worthy of further 
investigation can be seen. This graph shows individual 
optimisations being turned on and off, using the 01 
optimisation level as a base. The flags on the left 
of the 01 section were found to decrease the energy 
consumption when disabled, an effect not seen in the 
FFD results. These flags were chosen for further 
exploration. 

To explore this inconsistency, a small case study 
was performed, where all combinations of four options 
were explored. The energy figures for exhaustive 
exploration can be seen in Tab. 5, with the aim being 
to ascertain whether the effect of this energy reduction 
would compound with multiple flags. The 01 column of 
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X 


/ 

V 


X 
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1.40 


J34yu 


n 1 c: 


X 


X 




X 


5960 


3.08 


5480 


0.00 


/ 


/ 


X 


X 


5890 


1.91 


5470 


-0.19 


X 


/ 


X 


X 


5870 


1.61 


5570 


1.57 


/ 


X 


X 


X 


5690 


-1.56 


5480 


-0.03 


X 


X 


X 


X 


5880 


1.81 


5510 


0.41 



TABLE 5. Exhaustively exploring 4 options compared to 
01 and 02. (Cortex-M3 with blowfish benchmark). Legend 
in Tab. 6. 



Key 


Option 


XI 


-f guess-branch-probability 


X2 


-f tree-dominator-opts 


X3 


-f tree-ch 


X4 


-f if -conversion 


Abs (mJ) 


Absolute energy measurement in milli- 




joules. 


01 (%) 


Percentage relative to 01. 


02 (%) 


Percentage relative to 02. 


/ 


Optimisation is enabled. 


X 


Optimisation is disabled. 




TABLE 6. Legend for Tab. 5. 



this table shows the results of the options applied over 
the 01 optimisation level. The 02 column shows the 
same but on top of the 02 optimisation level. 

From the 01 column, this it can be seen that 
there are many interactions occurring between the 
options, as simply turning all of these options off 
does not decrease the energy (in fact it increases 
the consumption by 1.81%). Furthermore, when 
disabled individually, -f guess-branch-probability 
and -f tree-dominator-opts decrease the energy by 
2.49% and 1.76% respectively. However, when both 
enabled, the energy consumption (relative to 01) is only 
0.93% less, worse than each flag individually. 

Different results are seen entirely in the 02 column, 
with options that decreased energy consumption on top 
of 01 have little or the opposite effect when applied on 
top of 02. 

This unpredictability suggests that these options 
have many interdependencies that are difficult to 
predict up front. It also makes choosing an optimal 

set of optimisations very challenging. Therefore, one 
of our findings is that it is wxy unlikely any accurate 



prediction mechanism for considering an optimisation 
and its effect on a target system exists. The effect will 
always be highly dependent on the application to be 
used and the platform upon which it resides. 

9. WHAT DOES THIS MEAN FOR THE 
APPLICATION DEVELOPER? 

The existing collections of optimisations at the varioiis 
levels do a good job of optimising for performance, 
and consequently, energy. These strike a good balance 
between case of use and performance. However, they 
will never be as effective as those generated by searching 
through the full optimisation space. To avoid running 
many tests to find a good solution, developing machine- 
learning compiler technologies similar to MILEPOST 
GCC [29] would be fitting. A reasonable set of 
optimisations could be predicted based on high- 
level features and an architecture selection and this 
would greatly reduce the time spent searching. This 
is especially true as the effectiveness and type of 
optimisation was found to be heavily based on the 
platform and the structure of the application being 
compiled. Predicting the optimisations in this way 
would reduce compile times as well as the energy and 
execution time of the application. 

This study focused on GCC, since it is a 
mature compiler supporting many different platforms 
and optimisations. As an alternative, the LLVM 
compiler [35] is relatively new, with a well defined 
set of optimisation passes, whose order can easily be 
specified. This extra fiexibility means there may be 
a better solution to find, but also that it is essentially 
searching for a sharper needle in a bigger haystack. The 
benefits from having this much larger space to explore 
may not be worth the trade-off of the time it takes to 
find it. Therefore, it is unlikely that we will ever be able 
to obtain the goal of easy optimisation for run-time or 
energy, and will remain constrained to sub-optimal but 
balanced sets of flags such as 03 or longer search-based 
methods. 

10. WHAT DOES THIS MEAN FOR THE 
COMPILER WRITER? 

When designing a new optimisation, a compiler writer 
must check whether the optimisation is effective, and 
under what conditions. Using fractional factorial 
design a compiler writer can check whether the pass 
is effective when combined with an arbitrary set of 
other optimisations. This avoids the case of the 
optimisation being tested in isolation, which will result 
in an incorrect analysis because of the interactions 
between optimisations. We would recommend that, 
when selecting optimisations to be included in a broad 
optimisation level, the optimisation is evaluated in this 
way and only selected if it has a non-negative effect over 
all of the benchmarks. 
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All the optimisations we show in this paper are 
designed for either performance or code size. This 
means we cannot draw conclusions about the effect of 
dedicated compiler optimisations targeting energy such 
as those shown in the related work (Sect. 3.2). Although 
all optimisation targets may be beneficial for energy 
usage, dedicated energy flags would have to compete 
against these other optimisation metrics, meaning that 
even if they operate well in isolation, they may not do 
well when grouped. There are many opportunities for 
further work in this area. 

11. CONCLUSION 

The first hypothesis of energy consiimption and 
execution time being correlated in the general case 
was found to be correct across many platforms and 
benchmarks. This was first shown to be true by the 
high level results, showing only the overall optimisation 
levels. The more detailed fractional factorial design 
runs also demonstrated this result, showing that most 
optimisations had the same relative effect on energy 
and time. This result occurs because the majority of 
optimisations focus on reducing the total amount of 
work performed by the benchmarks - thus minimising 
both energy consumption and execution time. 

By adding and subtracting individual fiags on top 
of the whole optimisation levels we have shown that 
a better set of flags exists, which can produce more 
optimal applications. This validates our second 
hypothesis, giving results in line with much previous 
work. 

The third hypothesis stated that it was possible to 
efficiently search the optimisation space to gain infor- 
mation about the effectiveness of each optimisation. To 
perform this we leveraged fractional factorial designs, 
allowing us to test each optimisation in a greatly re- 
duced number of runs. This method allowed us to ex- 
plore complex effects seen on the Cortex- A8, where the 
SIMD unit helped achieve a lower energy consumption. 

The fourth hypothesis of there being no optimisation 
which was effective for all benchmarks and platforms 
was evaluated using the fractional factorial designs. We 
were able to extract the most effective optimisations 
for each benchmark and platform pair and these 
results showed that there was no single optimisation 
that was universally effective. Further analysis of 
adding and subtracting individual fiags showed that 
the optimisation space is chaotic, with optimisations 
interacting in unpredictable ways. 

The compiler writer can use these results and the 
fractional factorial design method to evaluate potential 
optimisation passes, ensuring that they perform well 
in a variety of configurations. Until a method for 
resolving the interactions between optimisations is 
found, it is envisioned that the developer could use this 
technique to eliminate optimisations that are not having 
a positive effcict on their application. This will spcx^d up 



compilation time as well as potentially improving the 
performance of their application. 
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APPENDIX A. HARDWARE SETUP 

All the measurements were taken using the INA219 
power monitoring IC [38], which provides power, 
current and voltage outputs. 

The Cortex-MO and Cortex-M3 boards both have 
a single measurement point, recording the power 
consumed by the whole microprocessor. For the 
BeagleBone there are three available measurement 
points: the Cortex- AS core (including caches), on- 
chip peripherals (power management, bus controllers) 
and the external SDRAM memory IC. This allows the 
effect of the compiler optimisations on the memory 
to be recorded. Adapteva's Epiphany board has two 
measurement points: the core power consumption and 
10 power consumption, whereas the XMOS board's 
measurement point gathers power consumption data for 
the core of the processor. 

The hardware measurements have several sources of 
error. The most apparent errors are variations in the 
timing: the INA219 is sampled at intervals of 1 ms 
and the power measurement integrated over this. Small 
inaccuracies occur from jitter in this interval. The ADC 
in the INA219 also fluctuated by ±30 /iV, however this 
was close to the noise floor of the measurements, so had 
no significant effect on the results. 
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